Conversation
sleeepyjack
left a comment
There was a problem hiding this comment.
Looks good. Shall I run some benchmarks comparing perf on H100?
Please do. I expect a maximum difference of 0.5% to 1%. |
|
🙅 Something's not right: |
|
About 50% slower for |
sleeepyjack
left a comment
There was a problem hiding this comment.
Not sure if the perf regression is worth investigating. If we can create a minimal repro we can ping our CG devs. I'm also ok with shelving the PR for now.
|
I cannot reproduce the performance regression with my local RTX8000: Could this issue be H100-specific? |
|
That's expected on a <sm_90 arch. From sm_90 going forward the function will use a different code path leveraging the new |
|
Update: I ran the same exact benchmarks on another H100 node (CTK 12.3) today and wasn't able to reproduce the regression I initially reported. Will run another test on a different node and report back. |
|
@sleeepyjack Any updates on H100 perf results? |
| auto const tile = cg::tiled_partition<CGSize>(block); | ||
| auto const found = ref.find(tile, key); | ||
|
|
||
| #if defined(CUCO_HAS_CG_INVOKE_ONE) | ||
| cg::invoke_one(tile, [&]() { | ||
| *(output_begin + idx) = found == ref.end() ? ref.empty_key_sentinel() : *found; | ||
| }); | ||
| #else | ||
| if (tile.thread_rank() == 0) { | ||
| *(output_begin + idx) = found == ref.end() ? ref.empty_key_sentinel() : *found; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
For context: This is the section I've been focussing on in my investigation.
|
I tested another H100 HBM3 node with CTK 12.3 and the performance regression is still present although much less pronounced (around 5-6% less throughput). The instruction diff between the baseline and this version is straightforward: if (tile.thread_rank() == 0) {...}compiles to which sets The Compared to the baseline, the new version additionally computes the mask of participating threads ( I have some profiles which I will share via Slack since I had to use an internal toolchain. |
This PR updates new open addressing implementations to use
cg::invoke_onewhen possible.It doesn't change legacy implementations like multimap or dynamic map, etc.